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L. INTRODUCTION 

With the ever increasing demand for extremely complex 
integrated circuits, today's electrical engineers and 
systems designers have to be knowledgeable in the design and 
fabrication of Very Larje Scale Inteyrated (VLSI) circuits. 
Several approaches exist today for the design of VLSI 
ep ate Vik wae These approaches include the interconnection of 
standard library cells, gate arrays, programmable Joga 
arcravs, and full custom design. Full custom design is the 
most time consuming and expensive of the three, cut cener- 
ally yields a more efficient VISI design in terms of circuit 
density and speed of cperation. 

Che methodology for full custom design that can be 
easily understood and implemented by the systems designer 
has been developed by Mead and Conway {Ref. 1}. This neth- 
odologv, coupled with the wide variety of computer-aided 
design (CAD) tools that are availabie, maxes it possikle for 
the systems designer to translate a design from a functional 
glock diagram, or a icgic dlagranj=to Ssiticem Intelligeme 
Simulaticn of the design prior to fabrication gives the 
designer a hign degree of confidence that the circuit fanc- 
tions as desired, barring any unforeseen fabrication errors. 

Another method that 1s availabie for the generaticn of 
VLSI circuits 1s the use of a silicon compiler which tana 
as input an algoritnhnic description of a circuit's desired 
functions and generates the final layout of a VLSI circuit. 
Using this approach to circuit design results in a rapid 
design turn-around time. This allows the system designer 
the akility to explore different architectures and find the 
method Lest suited tc solve a specific problem. Cne such 


compiler that is installed and running at the Na@wam 


Postgraduate School (NPS) is the MacPitts silicon compiler 
develcped at Massachusetts Institute of Technology's Lincoln 
Laboratory. The installation and initiai research on the 
MacPitts compiler is documented in work done previously by 
Carlscn [Ref. 2]. Carlson utilized the MacPitts silicon 
compiler to generate an 8-bit unsigned pipelined multiplier 
weeoe used in a digital filter. To provide the basis for 
compariscn of a full custom design and a design generated by 
the Macritts silicon compiler, a 16-bit two's complement 
Multiplier in three micron NMOS was hand-crafted using CAD 
tools currently available at NES. 

The discussion of a general carry-sSave addition (CSA) 
multiplier follows in Chapter 2. Chapter 3 presents the 
adaptaticn of the CSA multiplication scheme to the 16-kit 
two's complement multiplier. The remainder of Chapter 3 
contains the design and testing of the multiplier and a 
description of the CAL tools utilized. Chapter 4 presents a 
test flan for the VLSI circuit after its fabrication by the 
MOS Implementation Service (MOSIS) of the Defense Advanced 
Research Projects Agency. This is followed by a comparison 
of the hand-crafted and MacPitts generated multipliers in 


Gaapter 5. 


Ii. UNSIGNED BINARY MULTIPLICATION 

Dla this cChaygcem, the implementation of an unsigned 
binary parallel multiplier is described. Fils ey a btrief 
discussicn of the add-and-shiftt algorithm is presented. 
Although almost every reference in digital arithmetic 
contains a section on this alyorithm (also called sequential 
Multiplication), it is given here so that terminology and 
representations used in this chapter and the next tay be 
introduced. Next, a multiplication scheme utilizing sinul- 
taneous generation of partial products followed by sinmulta- 
neous reduction usinj carry-save addition (CSA) is 
described. The chapter concludes with a discussicn of 
implenrenting this parallel multiplication scheme as a fife- 
lined VLSI design. 


Aw. ADD-AND-SHIFT ALGCRITHM 


The Easis for the multiplier design presented in this 
chapter is the add-and-shift algorithm, which 1s sinilar to 
the way cone multiplies using pencil and paper. For example, 
as shewn in Figure z.1, in multiplying two binary numbers 
each bit of the multiplier requires a corresponding add-and- 
shift cperation. 

A mathematical representation of the add-and-shift algo- 
rithm for two n-bit numbers 1S given in Eyuation 2.1. This 
ecuation has been derived from chapter 2 of Introducticn to 


Computer Architecture by Stone and others [Ref. 3]. 


p - bp oon (eqn 2.1) 
k=0 


In this equation and throughout the remainder of this 
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0000 
1107 
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Figure 2.1 Paper and Pencil Multiplication. 


thesis, concatenation implies the logical AND, the symtol + 
implies the logical Ck, b represents the n-bit multiplicand 
vector, an represents bit n of the multiplier vector a and P 
represents the 2n bit product vector. Pigurce 2.2 1 ihus- 
trates this ccncept for the agultiplication of two &-rit 
operands and Figure 2z.3 introduces a convenient dot repre- 
sentation of the same multiplication. AS can be seen from 
Figure 2.2, multiplying two 8-bit operancs results in e€ight 
partial products which are added to form a 16-Eit final 


pma@duct. 


NeoNgX5X4yX3 Neo NyNom MULTIPLICAND 

| YeYeYsYas Yo Yp¥Yor  MULPIPLIER 
Az Neg AS -Ag Ag Ao Ay Age AN PARTIAL PRODUCT 
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Ce ane 307 G 
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hE on I -, Ley Ee Ia I ey 
IPs Fil alo ery ae Py Vo 
G57 Cig Ge Gy Gy GoG, Gio 

H> Ths Hs, 1, Hy Heol, UM, 


eS 1o i ea os 7 aa So SpSp— FINAL PRODUCT 








Figure 2.2 Multiplying Two 8-bit Operands. 
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Figure 2.3 Dot Representation. 


Be SIMULTANEOUS MATRIX GENERATION AND REDUCTION 


In terms of speed, tke kasic add-and-shift algorithm is 
the slowest of the multiplication schemes. One methcd to 
improve the speed of the basic sequential multiplier is to 
perform as many operations as possible in parailei. This 
method, knowh aS the Simultanoeus Matrix Generation and 
Reductior metkod [Ref. 4: pp. 132-147], is composed of three 
distinct steps. In the first step, all of the partial frod- 
ucts are Simultaneously generated. Ir the next step, the 
resultant matrix of fartial products is reduced using Cagme 
Save addition (CSA) until two vectors remain. Finally, seme 
two remaining vectors are added together tc form the final 


Produce. 
roducts Generation 


The simplest way to generate each bit positicn of 
the partial products is to use the iogical AND operaticn as 
a Vie Teme ave re Fer example, in Figure 2.2, each of the 
terms inthe eight partial products is the result cf a 
logical AND operation and alse corresponds to a single dot 


in each of the partial products of Figure 2.3. FOr anja 


multiplication this scheme requires nxn AND gates, which is 
a simple, tut hardware intensive scheme. 

it is possible to use encoding techniques that wili 
reduce the number of fartial products. One such method that 
reduces the number of fartial products by half is the modi- 
fied FPooth's algorithg. For a description of both Bocth's 
original and modified algorithms, the reader is referred to 
two presentations of these topics [Refs. 4,5: pp. 132-137, 
152-157}. 

Another way tc generate partial products is to use 
read oniy memories (RCMS). For example, the 8x8 nultiplica- 
tion cf Figure 2.2 can be implemented using four 256x®@ RCHs 
where each ROM pericrms a table lookup multiplication, as 
shown in Figure 2.4. 

In Figure 2.4, the 4-bit value of each element o£ 
Mes palcs (YO,x0), (Y0,X1), “(Y1,X%0)>° and” (Y1,X1) 1s ccncat- 
enated tc form an 88-bit address into the ROM table. The ROW 
location corresponding to the address contains a unicue 
Baplt product. Thus four tables are required to sinmultane- 
ously form the products YixXi, Y1xxX0, YOxX1, and YOxx0. 
Note that the YOxX0O and Y1xXi terms nave disjoint signifi- 
cance, thus only three terms must be added to form the final 
EEoduct. The number of rearranged partial products which 
must be summed is referred to as the matrix height h. his 
height ccrresponds to the numbter of initial inputs to the 
CSA tree. A generaiization of this scheme for up to a 64x64 
bit multiplication is shown in Figure 2.5. Each rectaayle 
in Figure 2.5 [Ref. 4: p. 138] represents a 4x4 ROM multi- 
plier prceduct. 

table I [Ref. 4: p. 139] summarizes the maximun 
height of the partial products for the three partial preduct 
Generaticn schemes discussed in this section. 

In the final design implemented in this thesis, the 


partial products were generated using the 1x1 multiplier 


is 


4-BIT 
BLOCKS 


YOxX1 PART IAL 
PRODUCTS 


BLOCKS ne eee 


REARRANGED 


a TOxXO PARTIAL 
eT PRODUCTS 


FINAL 16-BIT PRODUCT 








Figure 2.4 An 8x8 Multiplication Using ROMs. 


(AND gate) method. This method was chosen over the cther 
two kecause of its simple and regular implementation. 
Booth's algorithm was rejected aS a choice due to the 
complex nature of the control siynals that are reguired. 
The ROM partial product generation method was not chosen 
tecause it would reguire 16 ROMs of 65536 x 16 Fits to 
Simultaneously generate the 16 partial products needed in a 
16-bit multiplier. Other possible combinations of different 
size FOMs could also be used to generate the partial frod- 
ucts, cut due to chip area and feature size limitations 
imposed by MOSIS the KOM method of generating partial prod- 
ucts was rejected because it was not. feasible to construct 


on a Single chip. 
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Figure 2.5 ROM Multiplier Weighted Position Structure. 





2e Fartial Products Reduction 
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Cnce the partial products are generated, the next 
step is to reduce the n partial products down to two. Cne 
technique that can be used to accomplish this is to utilize 
Bernput, 2-output full adders performing CSA in a Wallace 
meee structure. 

The partial products for the 8x8 multiplication 


represented by Figure 2.3 can be viewed as adjacent columns 
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Matrix Height for Partial Product Generation Methods 
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of height h, where each column corresponds to all terms to 


the same power of 2, as shown in the Wallace tree structure 


Or FlguEenZ. us 


Figure 2.6 Partial Products in Wallace Tree Structure. 


To reduce these columns of height h, CSA is used to 
reduce three dots of column heigat to two dots. These two 
OULPUECdCES, which represent the familiar sum and carry 
outputs of a full adder, are placed in tne next level cf the 
tree structure in tké€ir appropriate power positions. en} 
general, the number cf required levels (L) of CSA required 
to reduce a Wallace tree structure of column height h tc two 


is given Ey Equation 2.2 {[Ref. Sz Speeto cee L can also be 
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viewed as the minimum number of full adder delays required 
to prcduce the pair cf column operands. For an 8x8 multi- 
plication, the maximum column height is h=8. Thus ee £OUr 
levels of CSA are required as illustrated in Figure 2.7 
[Ref. 4: pe. 1417]. 


h 
torso (egn 2.2) 


Table II [Ref. 4: faetoo | otowemeres nUuNDer of carry-save 


adder levels corresponding to various column heights. 


3. Carry Look-Ahead Addition 


SS ee a eee Se eee ee SE 


The final step in this multiplication scheme is to 
sum the two remaining vectors created by the CSA reduction 
scheme discussed in the previous section. The majcr consid- 
eraticn in the choice of addition nethods for the final 
summation is speed of operation. Ore method that signifi- 
cantly reduces the number of gate delays and increases the 
Speed over ripple carry addition is carry lookahead (CLA) 
eidition. Rather than give a full derivation of the CLA 
addition concept {Ref. 5: pp. 84-91], the basic operation is 
presented for the 32-bit CLA adder that is used in the final 
design isplemented in this thesis. 

Figure 2.8 represents the designed 32-bit CLA aider 
whick can be thought cf as operating in three steps. First, 
the two input vectors X and ¥ to be summed are broken into 
Y-bit blecks. These klocks are routed into a circuit called 
a block P & G generator. The block P & G generator looks at 
each 4-bit Llockh from X and Y to determine if a carry into 
the least siynificant bit position will propogate to the 
Carry out or the most Significant bit position of the block, 
The logic equations for these two signals, called _ tlock 
propogate (Pn) and blcck generate (Gn) respectively for bit 
Moeeticn n, ‘are given in Equations 2.3 and 2.4 for the nth 
bit position. Equations 2.3 through 2.15 are derived fron 
meet. 5: pp. 84-91}. 
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TABLE II 
Levels of CSA Needed vs. Maximum Column Height 


Column Height (h) | Number of Levels (L) 
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ig = (x, +.) (X,—1+ Y,-1) (X, 24 Lea (X,-s+ Y,-s) (egn 2,3) 


Gh a & i an (X. + Yn) Xn Yams + (x. + ¥.) An-1+ Yaa) Nes Vigo (egn 2-4) 


ate (x, er Y, ) (X,-1+ ) (X, 2+ Yq) Xn-s re 


Next, the bicck P and G signals are input into a CLA 
unit that generates the true carry Cn out of the next least 
Significant block C(n-1). FOE a 32=bit addation, two CLA 
units are required. The equations for the lower order CLA 


Mme are given in Eguations 2.5, 2.6, 2.7, and 2.8. 


ei = Gat PC, (eqn 2.5) 
Cy= G,+ P1G,+ P.P;C,, (egcn 2.6) 


Cig = Gys + PisGiuy t+ PisP\,G_+ PisP1,P7G5 + Hae yie quay. (eqn 2.8) 


Since in a multiplication of two numbers the carry into the 
feast Significant bit position is zero, the above four €gua- 


mes reduce to Equations 2.9, 2.10, 2.11, and 2.12. 


Cl="G « (eqn 2.9) 
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CReemG » + PeGs: (eqn 2.10) 
Ci= Gi ate P,,G; ale Py4P7G; (eqn 2s 11) 
Cig = Gist PysGqy t+ PysPG7+ PisPiPiGs (ean, 2. 12) 


Similarly, the equaticns for the upper CLA unit are given as 


Fyuations 2.13, 2.14, and 2.15. 


Coo = Gig t Pig ig (eqn 2. 13) 
Cog = Gast PasGig + PosP ig C 16 (egn 2. 14) 
Cog = Gor + PozGas + PozP22Gig + PoP o3P gC 16 (egn 2.15) 


Note that the carry out of the most significant bit is 
disregarded. This is because the result of nultiplying two 
16-bit operands yields only a 32-bit result. 

Finally, the carry signals generated by the previous 
two steps are added in 4-bit block ripple carry adders with 
their approyriate slices of X and Y¥ to form the 32-bit sun. 
Note that the carry cut of each 4-bit ripple carry adder is 


disregarded, as it was generated and used previously. 


C. PIPELINED ADAPTATION 


In the previous section, the implementation cf a 
paraliel CSA multiplier was described. This method can 
logically be partiticned into stages for realization asa 
Pipelined design. 

In pipelining any design or algorithm, the kLasic otjec- 
tive is tc introduce concurrency by taking the functicn to 
ke perfcrmed and partitioning it into several subfunctions. 
The f£clicwing properties [Ref. 6: p.j2 4j are important to 
consider when pipelining a design: 

1. Evaluation of tke basic function is equivalent to scme 

seguentiai evaluation of the subfunctions. 


2. The inputs for cne subfunction come tctally from the 


ZA 


outputs of the frevious subfunction in the evaluation 
seguence. 

3. Other than the exchange of inputs and outputs, there 
are no interrelationships between subfunctions. 

4. Hardware can be developed to execute each subfunction. 

5. The times reguired for these hardware units to perforn 
their individual evaluations are usually approximately 
egual. 

The hardware required to perform each subfuncticn of a 
pipeline is called a stage. At the output of each stage is 
a latch that is used to perforgr the actual exchange of ofer- 
ands Eretween stages. 

To fartition the CSA multiplier into its stages, a 
logical division of the subfunctions to be executed must be 
determined. One method that initially may come tc mind is 
to make the partial product reduction scheme using the 
Wallace tree structure as one stage of the pipeline and the 
CLA addition as a seccnd stage. This was rejected because 
for a 1€-bit multiply, the first stage would require vee 
full adder delays and an AND gate delay before being readv 
to be latched. In the second stage, the CLA adder would 
reguire the delay for the P and G jeneration, the true carry 
generaticn in the CIA unit, and four full adder delays 
before being ready to be latched. 

The next partiticning of subfunctions went one level 
further into defining each stage. The CLA adder was further 
Subdivided into three subfunctions. The first stage 
performs the generation of the P and & Signals basea on the 
two 32-bit input vecters- The next stage uses the P and G 
Signals generated in the previous stage to produce the true 
CadEL yy sSsigialc. In the third and final stage of the CLA 
adder, the 4-bit blocks are summed with their appropriate 
carry in signals generated in the previous stage to form the 


final | PEGd vet. In looking at the CLA adder portion, the 


oe 


longest delay occurs in the final stage. This delay has a 
magnitude of 4 full adder delays and it is this figure that 
is used to partition the Wallace tree reduction scheme into 
stages. 

For a 16-bit multiplication, the maximum height cf the 
Wallace tree is sixteen as shown in Table I. This maxinunp 
height reguires six levels of CSA addition (see Table ITI) 
before a cclumn height of two is obtained to be infut into 
the CIA adder. Also to be performed in this stage is the 
generaticn of each bit of the partial products through the 
use of AND gates. Starting at the beginning of the Wallace 
tree structure and keeping the stage delay at less than the 
four full adder delays of the CLA adder, the 1x1 multiply 
and three levels of CSA can be accompiished in the first 
stage of the pipeline. This leaves the next stage of the 
pipeline with the remaining three levels of CSA to fperforn 
before goitg into the 32-bit CIA adder for the generaticn of 
the final froduct. Figure 2.9 shows each stage of the fpiye- 
line and its subfunction. This pipelined structure is to be 
the one inplemented in the final design of this thesis with 
adaptations to allow for the implementation of a two's 


complement multiplier. 
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BLOCK P & G GENERATORS 





Figure 2.9 Pipelined CSA Multiplier. 
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A. TWO'S COMPLEMENT FULTIPLIER 


1. Theoretical Architecture 


The multiplication of two 16-bit signed numbers 
represented in two's complement form can be pferforned 
through the impiementation of Eyguation 3.1 [Ref. 3] where n 
equals sixteen. In Equation 3.1, the notation £' denotes 
the one's complement cf the multiplicand. 


n—2 


een. b — 2°- 1a 1b 


k=0 
n—2 
= Yee2" ah + 2a * a ( ps Ae y 
k=0 
n—2 


= ND) Ae ps5 te eee ee ee 
k=0 


(egn 3.1) 


Each partial product generated through the use of Equation 
3.1 1S summed with the remaining partial products as in the 
unsigned CSA multiplier discussed in the previous chapter 
with two exceptions. First, €ach partial product must have 
its most significant bit extended to the most significant 
bit of the final product. In the design used in this thesis 
for 16-kit operands, the west Significant bit of each 
partial product must be extended to bit position 31. 
second, the most significant Fit of the multiplier must be 
added into bit position 15. This insertion of the most 
Significant bit of the multiplier can also be accomplished 
by insertiny it twice into the final summation at bit fosi- 
tion 13 and once intc each of the bit positions 14 and 15. 
This is done in the final design of this multiplier to keep 


the maximum coiumn kéight to te input to the Wallace tree 


I. 


reduction scheme at sixteen. Figure 3.1 demonstrates the 
use of this equation directly on the multiplication cf two 
4W-bit two's complement numbers where n eguals four. 





0111 = +7 0111 = +7 
x¥0101 = +5 X¥1011 = <5 
OOOO0UTTT OOO0UUTTT 
0000000 0000111 
OC One 000000 
00000 11000 
00000 00001 
DUTUUUTT = +35 TTUTTTOT = -35 
1001 = -7 1001 = -7 
x0101 = +5 x¥1011 = -5 
111TTOOT 111 400m 
0000000 1171004 
111001 0G0000 
00090 00110 
00000 0000 1 
| AON Oiae= 3S DOTOOOTT = +35 
Figure 3.1 Two's Complement Multiplication. 








aD a a ee ee ee ee ee oe = 


Figure 3.2 Input to Wallace Tree Reduction Method. 
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Figure 3.2 shows, in dot notation, the partial frod- 
ucts generated with 1x1 multipliers using Equation 3.1 with 
the two exceptions discussed above for a 16-bit two's 
complement nultiplication. It is this structure that is 
input into the Wallace tree reduction scheme to Le reduced 
to a final maximum cclumn height of two. Since the- maxinun 
column height is sixteen for the 16-bit two's counplenent 
multiplication presented in this tnesis, six levels of CSA, 
as shewn in Figures 3.3 and 3.4, are reguired to decompose 
this structure to a maximum column height of two. The 
resulting two vectors generated by the CSA are then input 
into the CLA adder presented in the previous chapter. 

One interesting point to note is that the column 
Memgnt £cr certain cclumns iS only one. This is caused when 
CSA is ,erformed on three or less operands in a column and 
no carry into that column iS produced by the next lower 
Significant one. In these operand vectors, a zero iS in,ut 
for the appropriate bit position into the CLA adder. 

To perform this multiplication in a pipelined 
manner, latcnes must Le inserted at the end of each stage of 
the pipeline as discussed earlier. Since the first stage 
invoives a 1x1 multiplication to generate the partial rred- 
ucts andthree levels of CSA, the first latch must be 
inserted at the end cf the third level of CSA. At thas 
point, 143 bits of data must be transferred to the second 
stage. Therefore, the first latch is 143 bits wide. 
Similarly, the second stage ends after the sixth level of 
CSA is performed. This reyuires the second iatch to te 57 
bits wide. These 57 bits are then input to the CLA adder. 
The third stage of the circuit generates the block P and 6G 
Signals. These signals and the 57 bits of the two CLA oper- 
ands are then transferred to the fourth stage in a 70 bit 
wide latch. The fourth stage uses the P and G Signals to 


generate the true carry signals to be used in the fifth and 


rat 
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Partial Product Reduction Using CSA. 
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Figure 3.4 Partial Froduct Reduction Using CSA (cont'd.). 
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finalystage-. This requires a 64 bit latch at itS outpucees 
hold the carry signals and the two CLA operand vectors. The 
final product appears at the output of the fifth stage and 
is stcred in a 32 bit wide latch so that latched outputs can 
be provided to any suksegquent circuits that this multiflier 


May drive. 


2. Actual implementation 





The initial floorplan for the circuit 1s shew 
pas BOQh bak apes so This filcorplan closely follows the theoretical 
implementation with two exceptions. 

First, ina VLSI design, anAND gate used as a 1x1 
multiplier 1S implemented with a NAND gate followed Ey an 
inverter. This active-high signal is then input to “an 
activesthigh input, active-high output full adder in the 
first level of CSA. Rather than construct these two circuit 
elements in this manner, the actual implementation utilized 
a NAND gate as the 1x1 multiplier driving an active-low 
input, active-~high output full adder. Any Signal generated 
with a NAND gate as a partial product bit that is not used 
in the first level of CSA is simply routed through an 
inverter to convert it to an active-high signal for use in 
Subseguent levels of CSA. This provided a reduction cf 256 
in the numker of inverters to ke constructed. 

Second, the Sign bits of each of the partial frod- 
ucts must Le extended to bit position thirty-one. These 
extended bits must also be added in the Wallace tree reduc- 
tion of the partial products. When these sign cits are 
grouped [for input to a rull adder in the first level, up to 
fourteen adders have the same three inputs. Rather than 
duplicate the adders which would increase power consugrftion 
and usage of chip area, only one adder was used to calculate 
the sum and carry inputs to the next level of CSA. These 


high fancut sum and carry inputs are then superbuffered to 
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Panu e 36.5 Initial Floorplan. 


drive the second level of CSA. This resulted in a savings 
of thirty-five full adders not having to be implemented in 
emi con. 

The clocking of the circuit is accomplished ty a 
non-overlapping two-phase clock. Both phases are input to 
the circuit through separate input pads. An additional 
Signal called OP is frovided to allow for the implementation 
Bia Jevel sensitive scan design (LSSD) [Ref. 7}. In a 
ess), the contents of the latches are either loaded in 
Pocattel when OP 1S a high of serially shifted to an output 
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pad and serially loaded from an input pad when OP is low. 
This allcws the contents of each of the first four latcnes 
to be examined to aid in the detection of fabrication errors 
OF CALCULE Maliuact rence The output latch is not serially 
loaded or shifted to an output pad because its contents are 
directly available at the output pads. 


Be. DESIGN TOOLS 


Before the actual layout of a VLSI circuit can be under 
taken, certain CAD tccls are needed by the designer. First, 
a graphical layout editor is required to allow the designer 
to Ccnstruct a ViSt circu. Second, to allow for @ene 
implementation of complex logic functions, a PLA generator 
is desired. Next, the ability to employ a design rule 
checker on a layout is essential to insure that design rule 
violaticns do not unintentionally occur. Finally,  tomeme 
that perform circuit Simulation for logic, timing, and power 
consumpticn are useful in determining the proper operation 
On the desagqgned cincuat. 

In the design of the 16-bit pipelined multiplier, the 
CAESAR layout editor [Refs. 7,8] was used as the basis for 
the laycut of the entire chip. To facilitate the design of 
complex logic functicns, EQNTOTT [{Ref. 9] ani TPLA [ Ret. oF 


were employed to construct complex programmed logic arrays 


(PLAS) = LYRA [Ref. 9] was used to perform design rule 
checks on the circuit. Circuit simulation £08 eens 
Limime,; and power were performed by ESIM [Refs. 2,9], 


CRYSTAL [Refs. 10,11] and PCWEST [Ref. 3] atter a ae 
extraction was performed using MEXTRA [Ref. 9}. 

The manuals for each of the CAD tools discussed above 
are available on the NPS Computer Science Department's UNIX 
operating systen. Io obtain an on-line copy of the manual 


for a specific design tool, issue the command 
% cadman <design tool name>. 
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To oktain a hardcopy cf a certain CAD tool manual, issue the 


commard 
% cadman <design tool name> | Ilpr. 


This ccmmand will send a copy of the normal CAD manual to 


the linerrinter. 
Te ECNTOIT 


ECNTOTT iS a program which generates a truth takle 
Suitable for input tc TPLA from a set of Boolean equations 
which define the PLA outputs in terms of its inputs. The 


€guation syntax is 
NAME = EXPRESSION; 


where NAME is tae output variable name and EXPRESSICN is a 
Boolé€an equation in sum of products (SOP) form that reéepre- 
sents the output variable in terms of its inputs. In the 
SOP expression, the & symbol denotes the logical AND, the } 
symbol denotes the lcegical OR, and the ! symbol preceeding 
an operand denotes the logical inversion. The input and 
Output signal order, from left to right or top to bottom, as 
appropriate, can be ccntrolled with the INORDER and CUYUTCKDER 


commards. 
oe ELA 


IPLA is a technology independent PLA generator that 


Supports design rules in the following styles: 


J. Mead-Conway NMCS with butting contacts, no buried 
Gentacts. 

2- Mead-Conway NMCS with Euried contacts, no butting 
contacts. 

Eee Osis 3 micron Ekulk CMOS. 


SS, 


It takes as its input the output of EQNTOTT and generatesa 
PLA layout in the desired technology. The default output 
cption 1S a CAESAR file. TPLA can provide inputs and 
outputs cn either the same side (ciS version) or on opposite 
Sides (trans version) of the generated PLA. In addition, 
clocked inputs and/or outputs can be supported Ey MTPLA 


through another opticn selecticn. 
See IEW 


LYRA 1s a design rule checker that operates on 
graphical files in CAESAR format. It can be invoked either 
interactively while editing a CAESAR file or on a CAESAR 
file and run in the Eackground on the UNIX operating system. 
The interactive mode is discussed in earlier work done by 
Reid [Ref. 7]. In the background mode, LYRA iS 1nvokcdiim 


executing the command 
”* lyra filename.ca &. 


This generates a file named CHECKPT which contains the names 
of all subcells of the design being checked that have 
completed a design rule check. If an error is found in the 
parent cell or any cf its sukcells, a file with the same 
name orf filetype .iy 1s output to the user's current wecrking 
directory. This file contains ali error information anc can 
be edited uSing CAESAR to view the errors for further 
correction. This mcecde of operation for LYRA provides an 
excellent means for design rule checking large designs that 


normally would take a lonj time in the interactive mode. 


C. LAYOUT 


Cnce the designer has determined the architecture to be 
impiemented, the initial floorplan, and has masterec the CAD 


tools that are availakle, the next step in the design cycle 
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is to Eegin the layout of the actual circuit. One technique 
that is utilized in this design of a 16-bit pipelined multi- 
plier is a form of tke hierarchical design method. In this 
method, once the above three items are completed, the archi- 
tecture is examined to look fcr some basic buiiding blocks 
that cculd be designed and used repeatedly in the ccnstruc- 
meron Of the circuit. Upon examination of the architecture 
for the 16-bit pipelined multiplier, the four basic circuit 
elements that can be designed and iterated throughout the 
circuit are a full adder, a 4-Eit block P and G generator, a 
CLA unit, anda 11-bit latch cell. 

The full adder is the main element in both of the first 
two stages in the pipeline as well as a basic buildirg tlcck 
for the 44-bit ripple carry adders in the fifth stage. The 
first two methods of implementation that immediately arise 
are ccnstructing an adder Dy uSing either discrete gates or 
a PLA generator such as TPLA. A third method [Ref. 12] that 
is possitle is to use pass transiStors in a selectcr logic 
circuit tc yenerate the sum and carry bits that are condi- 
tioned on the three input bits to be added. 

In choosing the adder to be implemented, two main 
conSiderations in the selection of the adder are its speed 
and pewer consumption. Both the discrete gate and the PLA 
adders have a anrigher static power consumption than _ the 
selector adder because they contain more depletion puli-up 
transistors than the selector adder. After simuiaticn of 
these circuits for speed using CRYSTAL, it was found that 
the selector circuit, with a 14.7 nanosecond propagation 
delay, was faster than both of the other two by at least two 
nanoseconds. Therefcre, the selector adder was chosen as 
Stemor the basic building blocks or the circuit. Figure 3.6 
showS a circuit diagram of the selector adder used in the 
design cf the 16-bit sultiplier. Two minor drawbacks exist 
to the selection of this type of adder. When the outyut of 
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one adder drives the input of another, this is equivalent to 
the output of a pass transistor driving an inverter. To 
insure that the following adder inputs are driven tc the 
necessary voltage levels to operate properly, the inrfut 
inverters to each vertical selector rail must have a pull-up 
to pull-down ratio of eight. Also, the selector rail that 
provides the true signal to the circuit must pass through 
two inverters. This prevents the output of a pass tran- 
Sistor in the previous adder from directly driving the yate 


of a pass transistor in the current adder [Refs T: pp.- 


. 


SUM 








Figure 3.6 Selector Adder Circuit Diagram. 
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Both the 4-bit blicck P and G generator and the CLA unit 
are ccmplex logic functions weil-suited for implementation 
as PLAS. These twce circuit elements are implemented by 
inputting Eguations 2.3 and 2.4 (for the P and G generator) 
and Equations 2.9 to 2.15 (for . the CLA unit) into EQNIOT?T. 
The output of EQNTOTT is then piped to TPLA to generate the 
actual CAESAR files fcr the PLAS. Since data flows into cne 
Side and out from the opposite side of each stage, the trans 
versicn of the PLAS was constructed. 

The last building block of the circuit to be designed is 
Pee pit latch cell. SLOUCCeameLosSDwats ah Lnpcrtant 
criterion for designing the 16-bit multiplier, the 1-rit 
latch cell must be able to ke loaded either in farallel 
along the data path or in serial from an adjacent latci 
cell. This function is under control of the OP signal. 

To minimize the area consumed by the latch, a dynamic 
latch coaposed of a pair of inverters coupled by pass tran- 
Sistors was selected. As in the adder circuit, a pull-up to 
pull-dcwn ratio of eight is needed for the inverters because 
they are driven poy pass transistors. Figure 3.7 shows the 
circuit diagram of tre 11-bit latch cell as implemented. the 
operaticn of the latch ceili is as follows. For aormal cper- 
ation (OF=1), the NORMAL signai is high and the SHIFT signal 
f= low during PHI. Data appearing at the DATA IW port 
Grives tke first inverter. When PHI1 rails, the gate of the 
first inverter retains the logic value of DATA IN in its 
gate capacitance. When PHI2 rises, this data drives the 
second inverter which effectively transfers the data tc DATA 
OUT and the next stage. For a shifting operation (OP=0), 
the NCRMAL signal is low and the SHIFT Signal is high. Data 
appearing at the LATCE IN port, which connects to DATA OUT 
of the next latch cell to the left, charges the gate capaci- 
tance of the first inverter. The pass transistor transfers 


the data to the seccnd inverter on PHI2 aS ina normal 
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NORMAL PHI 2 TO NEAT CeeeE 
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SHIFT IN 
FROM PREVIOUS CELL 


Figure 3.7 11-bit Latch Celi: 


operation. This effectively shifts the data from the LATCH 
IN port to the LATCH OUT port in one cycle of m the clog 
Figure 3.8 shows the circuitry to condition PHD? wath CPi 
generate the NORMAL ard SHIFT signals used above. 


PHI 1 : 
NORMAL 
OP 
SHIFT 
Figure 3.8 Generation of the Control Signals. 
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Once these four Lasic building blocks are designed, each 
stage of the pipeline and its latch is developed out of the 
appropriate subcells. Next, the internal routing of Signals 
within a stage iS accomplished through the use of a wire 
ist. Then the five stages of the circuit are wired 
together to form the core of the design. Finally, all that 
remains to be done is to connect this core design to a irame 
to allow adequate interfacing for the packaging process. 

This routing of signals both within the core of the 
design and to the frame is an extremely time consuming task 
that requires as much time, effort, and planning as the 
design and layout or all the major components. The addition 
of an automatic router would be a welcome addition to any 
designer's CAD toolbag. 

The design frame is composed of a pad set that was 
obtained from MOSIS. These pads were specifically designed 
for fabrication at 1.5 microns per lambda. A copy of these 
pads is located in tke file 

/visi/perk83/1lib/pads15.cif 
and associated documentation can be found in the file 
/V1isi/berk82/doc/padsi5. 
Both cf these files are located in the NPS Computer Science 
Department's VAX11-780 running the UNIX operating systen. 

Numerous repetiticns of the design - rule check - rede- 
Sign cycle occurred before a final design was oktained. 
Using LYfhA for the design rule check on a large design such 
as the 16-bit gultfplier requires approximately 1000 CPU 
Minutes. When the UNIX system is heavily loaded, this 
results in a turn-artcund time on tne order of two tc three 
days. Figure 3.9 depicts the final design of the entire 
chip. Each of the six levels of CSA are Shown as levelt 
through level6. The latches are lapnelled latchxx where xx 
is the appropriate number of bits in the latch. MThe blcck P 


and G generators are designated PG and the CLA unit is 
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Simply shown as CLA. The 4-bit ripple carry adders are 
shown as ADD. Three blocks not previously discussed are 
labelled AMP. These are control line drivers that drive the 
high fancut NORMAL, ‘SHIFT, and PHI2 Signals to each of the 
latches. These drivers are composed of the Same circuitry 


used Ly the output pads to drive off chip loads. 
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Figure 3.9 Final Chip Floorplan. 


The actual plots cf each of the four building blocks and 
the final circuit layout are contained in Appendix A. MThese 
plots were generated using the program CIFPLOT { Ref. 9]. 
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CD. DESIGN VALIDATION 


The next step in the design cycle is to functionally 
validate the chip's operation refore it 1s sent to MCSIS for 
fabrication. This will give the designer a high degree of 
certainty that the chip operates loyically as desired with 
an approximate power consumption and at a certain maxiaun 
freguency cf operation. 

Before these three items can be accomplished, two 
preliminary steps must be accomplished. First, the CAZSAR 
file must be edited to label the nodes and a Caltech 
Intermediate Format (CIF) file generated. For the purpcse 
of performing design validation using CAD tools, the scale 
of centinicrons per lambda must be an even multiple of four. 
This prevents round-cff errors in the resultant CIF file. 
Since the final design is to be fabricated at lambda eguais 
1.50 microns, 152 centimicrons per lambda is used. Second, 
the CIF file must be rfassed through the MEXTRA fProgram using 


the ccmmand 
% mextra -o filename.cif §& 


so that a node extraction is performed on the circuit. On 
large files, it is extremely useful to run this program in 
the backcround mode as snown ty the 3 in this command. A 
large CIF file such as the one for the 16-bit multiplier can 
take up to thirty minutes of CFU time to run. When the UNIX 
systen is heavily loaded, this requires eight to ten kLours 
of real time. The output files are directly compatikle with 


the CAD Simulation tocls to be used. 


ieee £LOgGiCal Simulation 


The first step in any design validation process is 
to determine if the circuit functions as it was designed to. 


Today, as the complexity of VLSI designs increases, the 
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number cf possible inputs goes up tremendously. FOr 
example, to exhaustively test just the normal operation of 
the 1€-bit multiplier would require each possible comrina- 
tion cf the 16-bit multiplier and multiplicand inputs. The 


number of possible ccuabinations of the vectors a and E£ is 
(216)2 = 232 = 4,294,967, 296. 


The ESIM logic simulator is the CAD tool to be used 
for checking operaticn of the 16-bit multiplier. If a 
vector pair is input cnly once, without regard to order, and 
at an estimated rate cf two test vector pairs Simulated fer 


Minute, this wculd recuire 
4,294,967,296 vectcrsx! day/2880 tests=1.49x105 days. 


This amcunts to over 4085 years required to perforg an 
exhaustive test. 

Therefore, seven representative pairs of Geass 
vectors were selected for simulation to determine if the 
circuit operates correctly. Exhaustive testing is not 
possikle, but most fossible errors would be revealed by 
these few, carefully chosen test vectors. These seven test 


vectors are: 


1. +743 x +27 
26. =a eee 
Jo. + 1 OBese = 2 7 
Un? Pa = 2. 


5. +i2zs. = +691 
6. -1123 x +891 
7. -32768 x -32762 


These vectors were designed to test as large a number of 
Ssubcircuits as possible. The first four vector fairs test 
the basic architecture for the correct implementation cf the 


algorithm represented by Equation 32-418 The positive/ 
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negative and negative/negative test vector pairs also test 
the CLA adder's ability to produce a proper sum over the 
entire thirty-two bit width. The next two vector fairs test 
the ability of the CSA in the Wallace tree reduction scheme 
to produce a correct result in the upper sixteen bits of the 
product. The last test vector is the largest negative 
humber representable in 16-bit two's complement § form. 
Further simulation with additionai test vectors would 
increase the confidence of the designer in the ability of 
the circuit to properly Simulate a 16-bit two's comrplement 
tertaiplication prior to fabrication. 

Cnce the read-in of the .Sim file py ESIM is 
completed, the initialization of the circuit, the defining 
of watched nodes, and describing the clock cycles must be 
accomplished before any Simulation is performed. Rather 
than do this each time ESIM is entered, a macro file was 
created that is cailed at the beginning of each seéSsicn. 
This file is called init_eSim and 1S Snown in Figure 3.10 
for the 16-bit multiplier. The input vectors for the two 
operands are represented aS ainand bin. The resultant 
product vector is shcwn aS phigh and piow representing the 
upper and lower 16-bits of the 16-bit product, respectively. 
The iatch input and cutput signals are represented as the 
vectors latchin and latchout where the leftmost [it ccrre- 
sponds to the first latch and the rightmost EFit tc the 
fourth latch. 

Aiter initialization of the circuit by executing the 
init_e€Sim macro, at é€ach clock cycle the seven test vector 
pairs previously defined are input in seguential order. aie) 
Sadem Case, On the £rifth clock cycle after introduction of a 
test vector, tae correct product appeared at the outrut pads 
phigh and plow. This demonstrates that the circuit can 
Froperly multiply two 16-bit two's complement operands to 
yield a 16-bit result with the result dependent only cn the 
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W ain al5 al4 al3 al2 all alO a9 a8 a7 a6 a5 a4 a3 a2 al al 

W bin b15 bi4 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 bl bO 

W latchin 11 in 12 in }3_in 14 in 

W phigh p31 p30 p29 p28 p27 p26 p25 p24 p23 p22 p21 p20 pl9 pl8 pl? p16 
W plow p15 pl4 p13 p12 pll pl0 p9 p8 p7 p6 ps p4 ps p2 pl po 

W latchout 11 out 12 out 13 out 14 out 

K phil 01000 phi2 00010 


Figure 3.10 Initialization Macro for ESIM. 


inputs to the circuit five clock cycles prior. The results 
of this logic simulation are contained in Appendix B. 

The serial shiftiny of the latches was Simulated and 
used to generate the intermediate results discussed in the 
next charter. This also froved to logically operate as 
expected, thus giving the designer a high deyree of confi- 


dence that the circuit operates as desired. 
2. dJiming 


The CRYSTAL VISI timing analyzer is used to test for 
the worst case propagation delay in the circuit. Each phase 
of the clock in woth a normal and shifting operation jis 
checked for a critical path that is defined to be within cne 
percent of the worst case propayation delay. These critical 
paths determine the faximum clock speed at which the circuit 
Can properly operate. The worst delays found are discussed 


for each phase of the clock. 
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Cn the rising edge of an externally applied phil, 
the longest propagation delay occurs from the input pads 
until the data is stcred in the first inverter of the stage 
ie latch. This delay is found to be 558.82 nancseccnds. 
This long delay can te attributed to the two high fanouts 
that cccur in the data path of the first stage. The first 
is a fanout of sixteen that occurs at each input fad to the 
input of the sixteen NAND gates used as 1x1 multipliers. 
The seccnd is a fanout of fourteen that occurs at the end of 
the first stage where the full adder cells that correspond 
to the extended sign bits are distributed to drive fuli 
adders in the second stage . 

When phil falls, it takes 89.11 nanoseconds for the 
latch celis to turn of their input pass transistors) and 
isolate the data so it may be transferred during phi2. This 
fall time corresponds to the separation time between phil 
and phi2 when both clicck phases are low. 

Cnce a rising clock edge iS applied to phi2, it 
takes 9€&.26 nanoseccnds for the pass transistors in the 
datch cells to turn cn and charge the second inverter. To 
complete the transfer of data, these pass transistors must 
be disabled py the falling of phi2. This corresponds to the 
Minimum separation tetween the phi2 and phil clock fhases 
and is fcund to be 64.28 nanoseconds. 

Figure 3.11 depicts the minimum clock cycle for the 
16-bit multiplier as determined by CRYSTAL. This equates to 
a maximum overall cicck frequency of 1.234 MHz. The results 


of the CRYSTAL timing analysis are contained in Appendix E. 


3- Eower Consumption 


= — 


CC power requirements for the 16-bit multiplier are 
determined through the use of the CAD program POWEST. 
POWEST looks for pullup transistors and determines a total 


count of these devices. Using a reference power consumftion 
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Figure 3.11 Minimum Clock Cycle Parameters. 


for pullup transistors of certain sizes and tyfes, Jet 
obtains a maximum estimate of power consumed by assuming all 
pullups are on at the Same time. The average power consunp- 
tion is determined by assuming that only half of the pullups 
are cn at a given time. 

For the 16-bit auitiplier, the nmbaximum DC fower 
consurption is found to be 3.177 Watts with an average fower 
consumed of 1.983 Watts. The results of the POWEST sinula- 
tion are found in Apfendix B. 
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IV. TEST PLAN 


As stated earlier, the use of the logic Simulator ESIN, 
the CRYSTAL timing analyzer, and POWEST will give the 
designer a high degree of confidence that the circuit 
designed will perform as desired. Once the circuit aas been 
fabricated and received from MOSIS, 1t must be tested to 
insure that fabricaticn and/or bonding errors did not occur. 
Freliminary work done by Carlson ona 16-bit fifelined 
Multiplier indicates that errors in fabrication andyvsor 
bonding do actually cccur. In this chapter, a test plan for 
the verification of fower consumption, correct Logical cper- 


ation, and maximum speed of operation is presented. 


A. ITENTIFYING INPUT AND OUTPUT PINS 


After fabrication, the chip will come back packaged in 
an 84 jin square grid packaceé with 21 pins on each side. 
Since only 77 pins are used in the 32-bit multiplier, it is 
imperative that the fin to pad connections are accurately 
known. nO, GO. thas, one must properly orient the chip. 
Close examination of the chip wiil reveal the logo "GC ARMY" 
1ocated Ltetween the GND and Vdd rails that run arcund the 
perimeter of the chif. Place this logo in the southeast 
corner as Shown in Figure 4.1. USing this logo as a land- 
mack, proceed clockwise around tae chip Starting on the 
southern edge. 

Along the southern edye are twenty-one output pads that 
epe used for a porticn of the product. Representing the 
product as p31...-p0 where pO is the least Significant bit, 
the southern edge ccntains signals p6 through p26 as one 


moves frcem east to west. The western edge is made up of 
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Figure 4.1 
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five cutput pads and twelve input pads. Moving frcem south 
to north, the first five pads are p27 through p31. The next 
pad is the phi2 clcck input followed by the four latch 
serial inputs for latch 4 through latch 1. Then comes the 
Vdd pad followed by the six most significant bits of the 
multiplier a15 through a10. Moving west to east along the 
northern edge, the remainder of the multiplier inputs a3 
through aQ andthe eleven inputs of the multiplicand [£15 
through £5 are encourtered. Along the eastern edge going 
from north to south, the remainder of the multiplicand pads 
b4 through bO are found followed by the GND pad. Next are 
the fcur latch serial outputs for latch 1 through latch &. 
Next are the OP and phil inputs which are roilowed by the 
lower six bits of tke product vector pO through p5. has 
should ccmplete the circuit around the chip and leave one 
back at the logo. Extreme care must be exercised when 
tracing the fine wires from the bonding pads to the fins, 
especially along the east and west edyes where the number of 
pins is greater than the number of bonding pads. 

To power the chip +5 volts DC should be applied tc the 
Vdd pad and 0 volts tc the GND pad. All inputs should use 
Vdd to represent a logic 1 and GND for a logic 0. The 
outputs use the same levels as the inputs to represent tue 
two lcgic levels. Ie measure the outputs, they should be 
connected to a device with a high input impedance. 
According to the documentation for the pads, the output pads 
are designed to drive approximately two TTL loads, but nay 


require a puliup resistor to obtain a full Vid outrut level. 


Be. PCWER CONSUMPTION 


The simplest of the three tests to perform is to check 
the static (PC power ccnsumption of the circuit. Once input, 


output, and supply pins are properly connected, this can be 
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accomplished by inserting a milliammeter into the Vdd supply 
line and measuring the nuber of amperes the circuit is 
drawing. This value multiplied by the +5 volts of the fower 
Supply will give an approximate average DC power consump- 
fLVOn. This figure should be in the vicinity of the 1.983 
Watts predicted by PCOWEST. 


C. TESTING FOR LOGICAL OPERATION 


Since exhaustive testing of the 32-bit multiplier is 
virtually impossible, the same seven test vectors that were 
used in ESIM should re utilized to verify correct operation. 
In addition, other random vector pairs should be tested for 
correct operation in the circuit. At this point, speed of 
operation is not a ccncern and the clock frequency should be 
reduced by a magnitude of approximately ten from that 
predicted Ly CRYSTAL. This will insure that propagation 
delays dc not beccne a factor in determining logical 
COLEECtHESS:. 

First, the vector pairs should be applied one at a time 
and a @inimum of five clock cycles completed with OP ata 
logic 1. At the end of the fifth clock cycle, the cutput 
Should represent the correct product for the inrut fair. 
This will at least insure that the chid performs a 32-fkit 
two's Cotpiement  multiuplicatien: This should be done for 
each cf the seven test vector pairs that were used in ESIM. 
Next, each of the seven test vector pairs should ke appized 
every clcck cycle. After a delay of five clock cycles, the 
correct results should appear at the output during phi2 of 
each cycle of the clock. This establishes the fact that the 
chip can multiply in a pipelined manner. 

To determine if the latches can sSerially operate as 
cesigned, known sequences should be applied at the inputs 


with the OP pin at a logic 0. Since the latches that are 
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Sreoutetom tae tour datch Output pads are all of different 
iengths, the output of this operation will occur at 
different times for each pin. For latch 1, latch 2, latch 3 
and latch 4, the input sequence will start appearing at the 
appropriate output fin after 143, 57, 70 and 64 ciock 
cycles, respectively. 

If any of the test vectors fail, the intermediate latch 
results cf each vector pair can be shifted to an output pin 
for examination. This can provide an excellent aid ia 
locating circuit faults. The intermediate latch values and 
the final rfroduct outruts for each of the seven test vector 


pairs are found in Affendix C. 


D. TESTING FOR MAXIBUM SPEED 


The third and final test to be performed on the chips 
that pass the logic function testiny is to determine the 
Maximum freguency at which they will operate correctly. To 
accomylish this, the duration of the time that phil and phi2 
are high and the two interphase times when phil and fhi2 are 
low should ce separately reduced until an incorrect product 
1s generated. This should be done with each of the seven 
test vectors until a Binimum time is found for each of these 
four clock parameters. Then the worst case for each of 
these parameters over all seven test vectors can be called 
the minimum clock parameters for the 32-bit multiplier. The 
maximum cverall clock frequency for the chip is then just 
the reciprocal of the sum of the four minimum clcck 


parameters. 
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One of the main advantages of using a silicon comfpiler 
is that it provides an extremely fast transition time from 
the initial architecture to the final layout of the design. 
This author estimates that the total time to actually 
generate the design of the 8-bit multiplier by Carlson 
[Ref. 2] using the MacPitts silicon compiler was less than 
24 man-hcurs. Theoretically, at the end of this time, a 
functionally correct layout is generated. Later wcrk done 
ty Froede [Ref. 11] on this compiler has proven that 
MacPitts does not always generate a correct layout. ia 
compariscn, the time consumed in the design £ the 1 Geaame 
multiplier presented in this thesis is estimated at over 750 
man-hcurs. 

This design turn-around time advantage of uSing a 
siliccn compiler for chip generation allows the designer a 
great degree of freedcm to explore possible different archi- 
tectures to solve a problem and actually see the results in 
Silicon. This freedom is not enjoyed by the full custom 
designer whose architecture must be tkhorouyhly researched 
and optimized prior to the layout of the daG@elalecur Bie 
this is not the case, a tremendous loss of valuakle nan- 
hours ceccurs when the redesign of a chip's basic architec- 
ture must be undertaken. 

The use of a silicon compiler is not without 1ts disaqe 
vantages tLough. Three of the main areas that a siliccn 


comoviler generated chip is at a disadvantage are: 


1. density of transistors. 
2. speed of operation. 


3. power consumpticn per transistor. 


To make a specific comparison, dimoebat Multiplier 
generated by the MacPitts silicon compiler available at NPS 
was compared with the full custom multiplier of this thesis. 
The £f£cllowing sections discuss-the three main areas listed 
above. They are preceeded by a discussion of the _ two 


circuit architectures that are to be compared. 


A. FUNCTIONAL ARCHITECTURE 


The architecture of the 16-bit multiplier has alreacay 
keen thorcughly presented in the previous two chapters. dig 
Summary, the chip performs a 16-bit two's complement fipe- 
lined multiplication on 16-bit operands with a latency of 
five cycles of a two phase clock. The circuitry for this 
chip is designed using a minimum feature size of 3.0 micrcns 
and is whoily contained on one integrated circuit. 

The multiplier generated Ey the MacPitts silicon 
compiler performs an @-bit multiplication on unsigned 8-rit 
operands with a latency of eight cycles of a three phase, 
five segment clock. It uses the basic add-and-shift algo- 
rithm for the basis cf its architecture. Due to the limita- 
tions in chip dimensions, pin count, and minimum feature 
Sgenameosed Dy MOSIS at the time the chip was fabricated, 
this chip was designed with a minimum feature size of 4.0 
mrerons. It reguires the cascading of two identical inte- 
grated circuits to perform an 8-bit multiplication. 

Additionally, the 16-bit muitiplier employs a iLSSD tech- 
higue that allows the contents of each of the four interme- 
diate latches to be serially examined to aid in the 
@eeecticn of circuit fabrication errors. Tne MacPitts 
multiplier does not employ this technigue and determing 
mime tcation and/or design errors is extremely difficult, if 
not impessible, to perform by examining just the chip 
Outputs. A LSSD technigue could possibly have been included 


oe 


in the MacPpitts design, but if included the maximum chip 


area defined by MOSIS may have been exceeded. 


Be. CHIP AREA AND DENSITY 


Since both VLSI circuits are designed with different 
Minimum feature size, to provide a fair basis for comrparison 
or the two designs tke 16-bit multiplier is normalized toa 
4.0 micren feature size. Figure 5.1 Shows the resultant 
log file from the MEXTRA node extractor for both the 8-rit 
and 16-bit multipliers. This fiie contains the chif dimen- 
Sions in microns and the number of transistors in the 


CircuLe. 


Window: 0 676600 0 602400 
, 801 depletion 
1612 enhancement 
1398 nodes 


Window: -600 919350 -600 789300 
/ 3914 depletion 

11962 enhancement 
8503 nodes 


Custem 16-bit “Mult eer. 





MaACEItts S=bpiate ty taser. 


Figure 5.1 NEXTRA .~log Output. 


The size shown in Figure 5.1 for the 16-bit multiplier 


is based ona 1.5 minimum feature size. This results in 


chip dimensions of 9199.50 by 7899.0 microns. By current 
MOSIS limitations, tke maximum chip dimensions are 9200.0 by 
7900.0 microns. Therefore, at lambda equal 1.5 microns the 


overall design is within one micron or less of the maxinun 


allowed by MOSIS. Ncermalizing the circuit dimensions tc a 
4.0 micron minimum feature size, the 16-bit multiflier 
consumes an area 12,2€0.0 by 10,532.0 microns. By COlpam 


ison, the MacPitts generated 8-bit multiplier occupies an 
area 6766.0 by 6024.0 microns. The MacPitts chip consumes 
approximately one-third of the area of the hand-crafted 
multiplier. 

The cther main foint of interest that deals witk the 
physical characteristics of the chip is its transistor 
density cr number of transistors per square micron. Fcr the 
Heonmalized 16-bit multiplier, Figure 5.1 shows a_ total of 
15,876 transistors. This yields a transistor density of 
1.23 x 10-*% transistors per sguare micron. For the MacPitts 
multiplier, the MEXTRA node extraction found a tctal of 
Peel) tranSistors. This gives a transistor HenSity o£ 5.92 
x10-5 transistors per Square micron. One interesting foint 
to note is that the HacPitts compiler found eighty-four more 
transistcrs on the 8-bit multiplier than the MEXTEKA ncde 
emiractor did [Ref. 2}. One possible explanation for this 
difference is that VYacPitts generates some unusual tran- 


Sistor structures that were unrecognizable by MEXTRA. 


C. PCWER CONSUMPTION 


One area that is becoming more and more important with 
the increasing number of transistors per chip that is teing 
created Ey improved technology is the static DC power dissi- 
Paeren Of da VLSI circuit. For the purposes of froviding 
compariscns, the CAD frcgram PCWEST is used as the basis for 
reference. 
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For the 16-bit multiplier, the average DC power consump- 
tion is found to be 1.983 Watts with a Maximum power usage 
Cites. 17 2eWates. Using POWEST on the 8=-bit multiplier 
yielded an average DC power consumption of 0.352 Watts and a 
maximum power usage cf 0.667 Watts. Appendix B contains the 
results of the POWEST runs on both of the designs. Tne 
MacPitts silicon comfiler also outputs an estimate it makes 
of the raximum power consumed Ey a circuit. For the 8&£it 
multiplier, this value is 0.407 Watts. This value iS over 
thirty-five percent less than the POWEST maximum value. 

One way to possibly compare tae power consumpticnr for 
the two designs is to determine a power consumed fer tran- 
sistor figure. Using the maximum POWEST values for both 
designs yields 2.00 x 10-% Watts per transistor for the 
16-bit multiplier and 2.77 x 10-*% Watts per transistcr for 
the 8-rit multiplier. The difference between these two 
figures can be primarily attributed to the followins. The 
MacPitts multiplier useS nine two input NAND gates to 
generate the full adders used in each stage. The custon 
multiplier uses a selector adder composed primarily of pass 
transistcrs which consume no DC Static power. This results 
in an overall lower fcwer consumption per transistcr for the 


16-rit multiplier when compared to the 38-bit multiplier. 


D. SFEED OF OPERATICH 


As discussed earlier, CRYSTAL determined that the 
Maximum clock frequency for the 16-bit multiplier is 1.234 
nl Za MacPitts generated designs use a different ciccking 
scheme than the two fhase, ncn-overlapping clock ~resenieg 
by Head ana Conway [ Ret. UU: poecogde It uses a three phase, 
five segment overlapping clock to generate the ccnattiol 
Signals for each latch in the fipeline. For a full discugs 


Sion cf the MacPitts clocking scheme and how to use the 
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CRYSTAL timing analyzer on a MacPitts design, the reader is 
referred te work done by Froede [Ref. 11]. The timing anal- 
ysis was performed on the MacFitts multiplier in accordance 
with this document and the worst-case CRYSTAL timing resuits 
are cortained in Appendix B. 

The overall minitum clock feriod for a CRYSTAL design is 
fouad by adding the worst stage propogation delay that 
occurs during the first two segments of the clock to the 
last three clock segment delays. ror the 8-bit multiplier, 
the longest stage is the first. ime eritical path 1S found 
to run from the input pads, through the Weinberger array, 
aimee nen through C€ight full adders cascaded in series to 
perform cne summation of the partial products in the add- 
ampa-shiit algorithn. This delay was found to he 4833.89 
nanoseconds. The sum of the individual times for the clcck 
Signals tc travel frem the input pads to the latch cells 
during the last three segments of the clock is 207.174 rano- 
seconds. This results in an overall minimum clock period of 
5046.03 nanoseconds and a maximum clock frequency of 19&.176 
RZ « The high propegation time in the first stage of the 
circuit 1s due primarily to three things. First, high 
resistance folysiliccen is utilized for the long data runs. 
Second, no Signals are buffered in any way to provide an 
impreved Ssiynal sourcing capability to snelp combat the high 
fanouts and long data runs. Pond, ae G-=Dit ripple carry 
adder is utilized to sum two partial products in every stage 
of the pipeline. Fach i—bit rull adder in an 8-bit ripple 
carry adder is composed of nine NAND gates. The carry in 
Eetween ¢€ach full adder in the ripple carry adder is net 
routed directly, but is routed over a long polysilicon wire 


which also contributes to the high critical path delay. 


E. SUMMARY 


Tatle III summarizes the results for the comfariscn of 
the hand-crafted design and its silicon compiler generated 
counterrpart. The results are as expected with the custon 
design having a six-fold increase in maximum speed, a 
thirty-eigkt percent decrease in power consumption per tran- 
Sistor, and a doubling of chip density over the MacPitts 
design. The true advantage of the MacPitts silicon compiler 
is in its ability tec provide extremely rapid design turn- 
around time versus a hand-crafted design. AS research 
continues into the area of silicon compilation and improve- 
ments are made to existing compilers, they may someday 
become the powerful and useful tool that they have the 
potential to be. 


TABLE IIf 


Summary cf Comparison Statistics 


PARAMETER CUSTOM MULT MACPITTS Miuae 
SIA oy (0)0: 16 bits 8 bits 
OPERAND INPUTS 
Di teNSiehs WZ2266 x 10552 6706 x 6024 
{MICE Crs) 
DENSA ay NeceZ Soa SZ 0e = 
(transistors/micrcné) 
STA Tea pC LOE 
BS a! 
POWEST 
AVERAGE i233 Dees Oe 
MAXIMUM 3.177 ORi6o7 
MACPITTS 
MAXIMUM NA 0.407 
POWER/TIRANSISTOR 2o 00S 10s 2. HOOx% 10m 
(Watts) 
MAXIMUM FREQUZNCY Ws4ds 0 Moe.) 13 
(KHZ) 
DESIGN TIME 10 24 
(man-hours) 
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VI. CONCLUSION 

In this thesis, the application of carry-save addition 
to a 16-bit two's ccmplement multiplication and its imple- 
mentation as a pipelined VLSI design have been presented. A 
compariscn between this hand-crafted design andan 8-kit 
unsigned multiflier was developed. This comparison coupled 
with the experience gained in the actual design and comfuter 
Simulaticn of the multiplier leads to the following conclu- 


sions and recommendations. 


A. DESIGN OF THE MULTIPLIER 


If the design of the multiplier were to be undertaken 
again, three changes to the circuit would be desirable. 
Pest, the iuncorperation of a Static latch would be 
attemrted provided a feasible design that would fit intc the 
dimited available chip area could Le developed. A static 
latcn would insure that data remains valid and not be 
discnarged from the inverter's gate capacitance if toc slow 
a clock is applied. Second, the high fanout from the latch 
contrcl drivers would be divided into a tree structure. At 
its termination points would be smaller, more efficient 
drivers that would drive a fanout not greater than five. 
Third, improvements tc the buffering of the high fanout sign 
extended bits of the first stage and the outputs of certain 
1x1 multipliers would be accomplished. Both of the last two 
improvements would be directed at optimizing the maximun 
clock frequency Gr the multiplier. 

Anotker possible solution to the long propagation delay 
through the first stage is to partition the stage intc two 


stages with approximately egual delay. Aithough this would 
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reduce the ~ropagation delay through the first stage, the 
increase in routing complexity and area required fcr an 
additicnal 204-bit latch may not be feasible in current 
MOSIS limitations. 

The LSSD technigue is highly recommended to te arfrlied 
to any pipelined design so that the testing and detection of 
fabrication errors is made eaSier. Not only will the LSSD 
technigue prove beneficial in the after-fabrication testing, 
but it alse proved extremely useful in CAD simulation before 
fabricaticn to detect routing errors. The value of imple- 
menting a LSSD in most cases will far outweigh the increased 
complexity of the latch design and the potential frustration 
in searching for errcrs based on final latch outputs. 

A 32-bit CLA adder could fre developed tc complement the 
16-bit gwuliltiplier. This can be accomplished very ra;idly 
and with little additional effort by using the same method 
described in this thesis with the following exceprticn. 
Since the carry in te an adder 1S not necessarily zero, the 
equations actually input to EQNTOTT and TPLA should be 
Equations 2.3 through 2.8 and Equations 2.13 through Zags 
Additionally, the use of full 32-bit operands will require 


the expansion of all of the latches. 


Be. CAD HARDWARE AND SOFTWARE 


The combination of EQNTOTT AND TPLA proved to be a very 
useful pair of CAD tools in the development of complex logic 
functions. Additionally, TPLA appears extremely versatile 
with the different technologies available and its numerous 
Cptitens: 

CAESAR proved to be avery good design tool for the 
graphical layout GE a VLSI designe The installation ci its 
successor, the layout editor MAGIC, should greatly ease the 


routing rEurden of the designer. 
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The coming addition of hardware to support actual 
testing of chips that have been fabricated by MOSIS will 
greatly aid in determining the accuracy of availatle CAD 
Simulaticn tools. Once these in-house testing capakilities 
are available, extensive testing should be accomplished in 
the two multipliers discussed here. Ip particular, a 
detailed comparison should be made between CAD simulation 
and actual results in the areas of functional operation, 


maximum speed, and static DC power consumption. 


C. SILICCN COMPILATICN 


Even though the MacPitts program available at NES Ey no 
means prevides an optimum integrated circuit design, it is 
an excellent vehicle from which to study the area cf siliccn 
compilers. They prcevide an excellent alternative to the 
custonr, gate array, and standard cell interccnnection 
methods that are in tse today. Further research into opti- 
Bizing the existiny MacPitts silicon compiler for sfeed, 
power consumption, and transistor density Should be 


undertaxen. 
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APPENDIX A 


STIPPLE PLOTS 


On the foliowing fages are the stipple plots of the four 
basic building blocks that were used in the design of the 
16-bit multiplier. Fcllowing these is a stipple plot of the 
Tinai layout for the 16-bit two's complement nultiplier that 
was designed for this thesis. For the purpose of clarity 
and continuity, a stipple plot of the 8-bit multiplier 
generated by the KacpPitts Silicon compiler 1s also 
presented. All picts were made with the CAD prcgranm 
Cit er Gin. 
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APPENDIX B 
SIMULATION RESULTS 


The fcliowing pages in this appendix contain, in order, 
the resultant ESIM and CRYSTAL session for the 8-bit multi- 
plier, the CRYSTAL timing analysis for the 8-bit multiplier, 
and the POWEST estimates for both the 16-bit and 68-bit 
multipliers. 
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ESIS results for 16-bit two's complement multiflier. 


% esim) mult32.s1m 
11962 transistors, 8452 nodes (3914 pulled up) 


sim > @ init esim 

initialization took 33772 steps 
initialization took 4682 steps 
initialization took 2%0 steps 
initialization took O steps 
initialization took O steps 

step took 6 events 
Jatchout=0000 0 
plow=111L1J1I111111111 65535 
phigh=1111111111111111 65535 
latchin=0000 0 
bin=O000000000000000 =O 
ain=O000COOO000G0000 0 
op=1 


sim> R 5 


sim> v 

latchout=0000 O 

plow =0000000000000000 — 0 
phigh=0000000000000000 0 
latchin=0000 O 

bin =O000000000000000 = O 
ain =O000000000000000 =O 
op=1 

h inputs: Vdd op 

inputs: GND phil phi2 


sim> @ test vector] 

step took 451 events 
latchout=0000 O 

plow =0000000000000000 — O 
phigh= OOODDD00000000000_ 0 
latchin=0000 = 0O 

bin =0000000010001111 143 
ain=Q00000000001 1011 Zt 
op=1 
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Shy = % 

latcheutr OO0Q 0 

plow OOVOO00000000000 — 0 

phigh = 0000000000000000_ 0 

latchin=0000 =O 

bin— 0000000010001111 143 
ain -QOOO0O0000000110F1 = 27 

op= 1 

cycle took 3785 events 


sim> @ test_vector2 

step took 1927 events 
latchout=0000 O 

plow = 0000008000000000 0 
phigh =O0000000000000000_ 0 
latchin=0000 =O 
bin=1111111101110001 65393 
atin =000000000001 1011 i 
op=1 

sim> c 

latchout=0000 0O 

plow =0000000000000000 0 
phigh=O000000000000000_ 0 
latchin=0000 O 
bin=1111111101110001 65393 
ain=0000000000011011 27 
op=1 

cycle took 4888 events 


sim> @ test vector3 

step took 2819 events 
latchout=0000 O 

plow =0000000000000000 0 
phigh =0000000000000000. 0 
latchin=0000 = O 
bin=0000000010001111 143 
ain=1111111111100101 65509 
op=1 

sim>c 

latchout=0000 0 

plow QOOOOOOOOO000000  O 
phigh =0000000000000000. 0 
latchin=0000 O 
bin=0000000010001111 148 
ain=1111111111100101 65509 
op=1 

cycle took 5243 events 
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sm - @ test vector4 

step took 4777 events 
larchout=0000 0 

plow = O00000000000000G 0 
phigh=0000000000000000 0 
latchin=0000 0 
bin=1111111101110001 65393 
ain=1111111111100101 635509 
op=l1 

sim> ¢ 

latchout=0000 O 

plow =0000000000000000 0 
phigh -0000000000000000 0 
latchin=0000 O 

bin= LENORE OOo la 6o093 
ain= LPP ce 107 65509 
op=1 

cycle took 4821 events 


sim> @ test vector5 

step took 3405 events 
latchout=0000 0 

plow =0000000000000000 0 
phigh =O0000000000000000 0 
latchin=0000 =O 

bin =0000010001100011 1123 
ain=0000001101111011 89) 
op= 1] 

sim> ¢ 

latchout=0000 O 

plow =0000111100010101 3861] 
phigh=O0000000000000000 0 
latchin=0000 0 

bin =0000010001100011 1123 
ain=0000001101111011 891 
op=] 

cycle took 5981 events 


stim test vector6 

step took 212] events 
latchout=-0000 0 

plow ~ 0000111100010101 3861 
phigh = OOO0000000000000 0 
latchin=0000 = 0 
bin=0000001101111011 891 
ain=-11111011100]1101 64413 
op=1 

sim> ¢ 

latchout=0000 0 
plow=1111000011101011 61675 
phigh=111111111 1111111 655285 
Jatchin=0000 OO. 

bin =0000001101111011 891 
ain=1111101110011101 64413 
op=1 

cycle took 5341 events 


sim> @ test vector? 

step took 1708 events 
latchout=0000 0 
plow=1111000011101011 61675 
phigh=1111111113111111 65535 
latchin=0000 O 
bin=1000000000000000 32768 
ain= 1000000000000000 32768 
op= 1 

sim> ¢ 

Jatchout=0000 0 
plow=1111000011101011 61675 
phigh=1111111111111111 65535 
latchin=0000 O 

bin= 1000000000000000 3276 
ain ~ LOOOOODOOONODOQOON = 3.2768 
op—1 

cycle took 5084 events 
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sim > c¢ 

Jatchout=0000 0 
plow=0000111100010101 3861 
phigh =0000000000000000 0 
Jatchin=0000 0 
bin=1000000000000000 32768 
ain= LOOODODO0N000000000 =. 32768 
op= ] 

cycle took 4786 events 


sim> ¢ 

Jatchout=0000 0 

plow =0100010010010001 17553 
phigh=0000000000001111 15 
Jatchin=0000 O 

bin =1000000000000000 32768 
ain=1000000000000000 32768 
op=1 

cycle took 4170 events 


sim> c 

latchout=0000 0 
plow=1011101101101111 47983 
phigh=1111111111110000 65520 
Jatchin=0000 O 
bin=1000000000000000 32768 
ain=1000000000000000 32768 
op=1 

cycle took 4280 events 


Sia = 

Jatchout=0000 O 

plow = 0000000000000000 =O 
phigh=0100000000000000 16384 
Jatchin=O0000 0 

bin= 1000000000000000 32768 
ain=1000000000000000 32768 
op=1 

cycle took 3953 events 


sim> q 
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CRYSTAL results for 16-bit two's complement multiplier. 


Crystal, v.2 
> build mult32.sim 
[1:12.1u 0:12.4s 1786k] 


: inputs a<15:0> b<15:0> op phil phi2 
[(0:00.1u 0:00.1s 1795k| 

: inputs 11 in 12 in 13 in 14 in 

[0:00.0u 0:00.0s 1795k| 

: outputs p<31:0> 11 out 12 out 13 out 14 out 
(0:00.0u 0:00.0s 1795k| 


: markdynamic phil O phi2 0 
Marking transistor flow... 
Setting Vdd to 1... 

Setting GND to 0... 

(0:08.1u 0:01.1s 1795k| 


*** RISETIME FOR PHI2 INNORMAL OP *** 
; set l op 
[(0:00.5u 0:00.1s 1795k| 
: set O phil 
(0:00.7u 0:00.1s 1795k] 
: delay phi2 0-1 
(12279 stages examined.) 
[(0:46.8u 0:04.6s 1855k| 
: critical lm 
Node 14171 is driven high at 98.26ns 
..through fet at (2772, 1751) to Vdd after 
16259 is driven low at 95.79ns 
...through fet at (2792, 1810) to GND after 
16968 is driven high at 92.08ns 
..through fet at (2800, 1819) to 17829 
...through fet at (2794, 1823) to Vdd after 
1273 is driven high at 89.36ns 
...through fet at (313, 1486) to Vdd after 
11735 is driven high at 35.90ns 
..through fet at (303, 1506) to Vdd after 
11765 is driven high at 14.17ns 
through fet at (287, 1506) to Vdd after 
11745 is driven low at 10.03ns 
... through fet at (285, 1422) to GND after 
11764 is driven high at 5.79ns 
...through fet at (160, 1582) to Vdd after 
12847 is driven low at 0.11ns 
... through fet at (156, 1604) to GND after 
phi2 is driven high at 0.00ns 
[(0:00.3u 0:00.1s 1855k| 
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*** FALLTIME FOR PHIZ IN NOR vise leee eee 
: clear 
0:00.9u 0:00.35 1855k| 
: set l op 
Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
'0:06.4u 0:00.6s 1855k] 
: set O phil 
(0:00.8u 0:00.1s 1855k] 
: delay phi2 -1 0 
(16400 stages examined.) 
'0:58.8u 0:02.6s 1879k| 
: critical Im 
Node 11983 is driven low at 64.28ns 
..through fet at (2836, 1550) to GND after 
12776 is driven high at 64.98ns 
through fet at (2842, 1602) to 13219 
through fet at (2852, 1602) to 13220 
...through fet at (2863, 1645) to Vdd after 
12892 is driven high at 54.01Ins 
...through fet at (2840, 1645) to Vdd after 
13081 is driven low at 53.08ns 
...through fet at (2836, 1656) to GND after 
14010 is driven high at 55.67ns 
....through fet at (2756, 1696) to 14572 
..through fet at (2772, 1696) to 14437 
...through fet at (2782, 1751) to Vdd after 
14171 is driven low at 35.54ns 
...through fet at (2767, 1756) to GND after 
16259 is driven high at 33 63ns 
through fet at (2794, 1800) to Vdd after 
16968 is driven low at 22.80ns 
..through fet at (2800, 1819) to 17829 
...through fet at (2792, 1816) to GND after 
1273 is driven high at 21.54ns 
...through fet at (313, 1486) to Vdd after 
11735 is driven hivh at 13.39ns 
...through fet at (293, 1506) to Vdd after 
11765 is driven low at 10.69ns 
...through fet at (285, 1483) to GND after 
11745 is driven high at 7.19ns 
...through fet at (287, 1410) to Vdd after 
11764 is driven low at 2.51ns 
...through fet at (156, 1581) to GND after 
12847 is driven high at 0.56ns 
..through fet at (163, 1604) to Vdd after 
phi2 is driven low at 0.00ns 
'0:00.3u 0:00.1s 1879k| 
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eal noe LIME INeNORMAL OP *** 
: clear 
|(0:00.9u 0:00.3s 1879k] 
: set 1 op 
Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
{(0:06.5u 0:00.5s 1879k| 
: set O phi2 
{(0:00.2u 0:00.0s 1879k| 
: delay phil 0 -1 
(5926 stages examined.) 
(0:12.1u 0:00.6s 1879k] 
: critical Im 
Node 17518 is driven high at 108.62ns 
...through fet at (2256, 1845) to 18827 after 
normdrout is driven high at 101.60ns 
...through fet at (4013, 1343) to Vdd after 
10876 is driven high at 48.60ns 
...through fet at (4141, 1351) to Vdd after 
11000 is driven high at 26.8I1ns 
...through fet at (4163, 1351) to Vdd after 
11302 is driven low at 22.55ns 
...through fet at (4166, 1423) to GND after 
11063 is driven high at 17.47ns 
...through fet at (4408, 1354) to Vdd after 
11064 is driven low at 6.61ns 
..through fet at (4433, 1362) to 11369 
..through fet at (4433, 1366) to GND after 
10622 is driven high at 5.72ns 
...through fet at (4483, 1305) to Vdd after 
10603 is driven low at 0.11ns 
.. through fet at (4498, 1281) to GND after 
phil is driven high at 0.00ns 
|0:00.1u 0:00.1s 1879k| 


Php AvbaiME FOR NORMAL OP *** 
: clear 

/(0:00.8u 0:00.3s 1879k] 

: set lop 

Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
(0:06.2u 0:00.1s 1879k] 

: set O phi2 

(0:00.2u 0:00.0s 1879k| 

: delay phil -1 0 

(4092 stages examined.) 
(0:10.4u 0:00.6s 1896k| 
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- critical Im 
Node 4675 is driven low at 89.11ns 
...through fet at (2091, 781) to GND after 
4486 is driven high at 83.87ns 
...through fet at (2736, 842) to Vdd after 
normdrout is driven low at 39.96ns 
...chrough fet at (4021, 1351) to GND after 
11059 is driven high at 32.15ns 
...through fet at (4141, 1446) to Vdd after 
11302 is driven high at 10.54ns 
...through fet at (4163, 1446) to Vdd after 
11063 is driven low at 5.9Ins 
...through fet at (4407, 1362) to GND after 
11064 is driven high at 2.13ns 
...through fet at (4434, 1354) to Vdd after 
10622 is driven low at 2.49ns 
...through fet at (4498, 1304) to GND after 
10603 is driven high at 0.56ns 
...through fet at (4489, 1282) to Vdd after 
phil is driven low at 0.00ns 
'0:00.2u 0:00.1s 1896k| 


*** PHMORISE TIME BORsshIE VOR =* 
: clear 
(0:00.9u 0:00.38 1896k] 
: set O op 
Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
10:06.6u 0:00.5s 1896k] 
: set O phi2 
|0:00 2u 0:00.0s 1896k] 
: delay phil O -! 
(11989 stages examined.) 
(0:42.1u 0:01.7s 1918k] 
: critical lm 
Node 4354 is driven high at 343.02ns 
through fet at (2743, 502) to 3227 
...chrough fet at (2734, 463) to Vdd after 
shdrout is driven high at 45.7&ns 
...through fet at (4007, 1223) to Vdd after 
10522 is driven low at 29.07ns 
...through fet at (4053, 1228) to GND after 
10336 1s driven high at 27.68ns 
...through fet at (1067, 1216) to Vdd after 
10523 1s driven low at 24.23ns 
... through fet at (4070, 1266) to GND after 
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19589 is driven high at 19.49ns 

through fet at (4407, 1334) to Vdd after 
10633 1s driven low at 6.6Ins 

through fet at (4433, 1327) to 10631 

. through fet at (4433, 1324) to GND after 
10622 is driven high at 5.72ns 

..through fet at (4483,.1305) to Vdd after 
10603 is driven low at 0.11ns 

...through fet at (4498, 1281) to GND after 
phil is driven high at 0.00ns 

[(0:00.1u 0:00.1s 1918k| 


*** PHI] FALLTIME FOR A SHIFT OP *** 
: clear 
(0:00.8u 0:00.3s 1918k| 
: set 0 op 
Marking transistor flow... 
Setting Vdd to l... 
Setting GND to 0... 
(0:06.4u 0:00.4s 1918k] 
: set O phi2 
[(0:00.2u 0:00.0s 1918k] 
: delay phil -1 0 
(20633 stages examined.} 
[1:22.2u 0:08.4s 1961k| 
: critical Im 
Node 11983 is driven low at 72.04ns 
...through fet at (2836, 1550) to GND after 
12776 is driven high at 70.74ns 
...through fet at (2842, 1602) to 13219 
.. through fet at (2852, 1602) to 13220 
..through fet at (2863, 1645) to Vdd after 
12892 is driven high at 61.78ns 
...through fet at (2840, 1645) to Vdd after 
13081 1s driven low at 60.84ns 
...through fet at (2836, 1656) to GND after 
14010 is driven high at 63.43ns 
...through fet at (2756, 1696) to 14572 
.. through fet at (2772, 1696) to 14437 
..through fet at (2782, 1751) to Vdd after 
14171 is driven low at 43.30ns 
.. through fet at (2767, 1756) to GND after 
16259 is driven high at 41.39ns 
.. through fet at (2794, 1800) to Vdd after 
shdrout is driven low at 30.70ns 
...through fet at (4032, 1225) to GND after 
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10522 is driven high at 19.22ns 

...through fet at (4045, 1289} to Vdd after 
10523 is driven high at 10.01Ins 

...through fet at (4067, 1289) to Vdd after 
10589 is driven low at 6.29ns 

..through fet at (4406, 1327) to GND after 
10633 is driven high at 3.13ns 

..through fet at (4434, 1334} to Vdd after 
10622 is driven low at 2.49ns 

...through fet at (4498, 1304) to GND after 
10603 is driven high at 0.56ns 

...through fet at (4489, 1282} to Vdd after 
phil is driven low at 0.00ns 

(0:00.2u 0:00.2s 1961k] 


*** INPUT PAD TOA CHa DEE AN aes 
: clear 
'(0:00.9u 0:00.7s 1961k| 
: set lop 
Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
(0:06.3u 0:00.3s 1961k] 
: set O phil phi2 
(0:00.9u 0:00.1s 1961k] 
: delay a<15:0> 00 
(43921 stages examined.) 
(2:16.6u 0:09.2s 1961k| 
: critical Im 
Node 19554 is driven high at 558.82ns 
..through fet at (1008, 2140) to Vdd after 
19655 is driven low at 554.42ns 
...through fet at (980, 2145) to GND after 
21705 is driven high at 531.79ns 
.. through fet at (667, 2760) to 27839 
..through fet at (677, 2760) to 27714 
...through fet at (693, 2798) to Vdd after 
27436 is driven low at 485.22ns 
...through fet at (698, 2808) to GND after 
22366 is driven high at 473.40ns 
... through fet at (1823, 3125) to Vdd after 
30352 1s driven low at 337.44ns 
... through fet at (183. 3142) to GND after 
30351 is driven high at 332.90ns 
... through fet at (1807, 3257) to 33567 
...chrough fet at (1817, 3257) to a3000 
... through fet at (1840, 3306) to Vdd after 
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33186 is driven low at 299.22ns 

... through fet at (1818, 3293) to GND after 
33391 is driven high at 298.15ns 

.. through fet at (1822, 3306) to Vdd after 
30591 is driven low at 295.49ns 

...through fet at (1955, 3577) to 38872 

...through fet at (1955, 3580) to GND after 
38615 is driven high at 241.93ns 

..through fet at (1997, 3813) to Vdd after 
40527 is driven low at 3.37ns 

.. through fet at (2011, 3839) to GND after 
40457 is driven high at 2.61ns 

... through fet at (2030, 3824) to Vdd after 
40625 is driven low at 0.11ns 

...through fet at (2052, 3839) to GND after 
a2 is driven high at 0.00ns 

[0:00.2u 0:00.2s 1961k| 


- q 
[8:58.2u 0:49.0s 1961k] Crystal done. 
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CRYSTAL results for stage 1 for the MacPictsecmacn 


build stagel.sim 

O 12.4u 0:01.3s 247k] 
: inputs in<27:1> 
'0:00.0u 0:00.1s 256k] 
: outputs a<24:1> 


*** FIRST STAGE DEVA= = 
: delay in<27:1>00 
Marking transistor flow... 
Setting Vdd to l... 
Setting Gib stow. 
(11559 stages examined.) 
[(0:22.7u 0:00.9s 411k] 
critical 
Node 2195 is driven high at 4838.89ns 
... through fet at (565, 934) to Vdd after 
2118 is driven low at 4831.44ns 
...through fet at (506, 926) to 2127 
...through fet at (506, 921) to GND after 
2095 is driven high at 4825.41ns 
...through fet at (485, 928) to Vdd after 
1867 is driven low at 4813.82ns 
...through fet at (423, 922) to 2086 
...through fet at (423, 917) to GND after 
1805 is driven high at 4783.75ns 
..through fet at (669, 910) to a2 
..through fet at (683, 910) to 1944 
...through fet at (620, 934) to Vdd after 
2119 is driven low at 4330.98ns 
...through fet at (585, 924) to 2103 
...through fet at (585, 919) to GND after 
2048 is driven high at 4326.95ns 
...through fet at (537, 930) to Vdd after 
1933 is driven low at 4314.44ns 
...through fet at (645, 1000) to 2790 
...through fet at (645, 1005) to GND after 
2730 is driven high at 4306.41ns 
...through fet at (537, 1010) to Vdd after 
2798 is driven low at 4293.82ns 
...through fet at (506, 1006) to 2807 
...through fet at (506, 1001) to GND after 
2775 is driven high at 4287.69ns 
...through fet at (485, 1008) to Vdd after 
2551 is driven low at 4275.76ns 
...through fet at (423, 1002) to 2766 
...through fet at (423, 997) to GND after 
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2525 is driven high at 4243.64ns 

.. through fet at (669, 990) to a3 

...through fet at (683, 990) to 2637 

... through fet at (620, 1014) to Vdd after 
2799 is driven low at 3741.79ns 

...through fet at (585, 1004) to 2783 

.. through fet at (585, 999) to GND after 
2624 is driven high at 3735.64ns 

...through fet at (652, 1074) to Vdd after 
3236 is driven low at 3712.11ns 

...through fet at (423, 1082) to 3449 

...through fet at (423, 1077) to GND after 
3210 is driven high at 3680.28ns 

...through fet at (669, 1070) to a4 

...through fet at (683, 1070) to 3318 

...through fet at (620, 1094) to Vdd after 
3482 is driven low at 3186.59ns 

...through fet at (585, 1084) to 3466 

...through fet at (585, 1079) to GND after 
3411 is driven high at 3182.56ns 

...through fet at (537, 1090) to Vdd after 
3307 is driven low at 3170.04ns 

...through fet at (645, 1160) to 4149 

...through fet at (645, 1165) to GND after 
4087 is driven high at 3162.01ns 

...through fet at (537, 1170) to Vdd after 
4157 is driven low at 3149.43ns 

...through fet at (506, 1166) to 4166 

...through fet at (506, 1161) to GND after 
4133 is driven high at 3143.25ns 

...through fet at (485, 1168) to Vdd after 
3907 is driven low at 3131.2Ins 

...through fet at (423, 1162) to 4124 

...through fet at (423, 1157) to GND after 
3881 is driven high at 3098.30ns 

...through fet at (669, 1150) to a5 

...through fet at (683, 1150) to 3990 

through fet at (620, 1174) to Vdd after 
4158 is driven low at 2577.22ns 

...through fet at (585, 1164) to 4141 

...through fet at (585, 1159) to GND after 
3978 is driven high at 2571.91ns 

...through fet at (652, 1234) to Vdd after 
4770 is driven low at 2555.05ns 

...through fet at (530, 1244) to 4825 

...through fet at (530, 1239) to GND after 
4841 is driven high at 2547.85ns 

...through fet at (513, 1252) to Vdd after 
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4818 is driven low at 2432 70ns 
..through fet at (478, 1242) to 4810 

.. through fet at (478, 1237) to GND after 
4568 is driven high at 2501.33ns 

...through fet at (669, 1230) to a6 

...through fet at (683, 1230) to 4677 

...through fet at (620, 1254) to Vdd after 
4842 is driven low at 1985.61ns 

...through fet at (585, 1244) to 4826 

...through fet at (585, 1239) to GND after 
4666 is driven high at 1980.29ns 

...through fet at (652, 1314) to Vdd after 
5456 is driven low at 1963.43ns 

...through fet at (530, 1324) to 5508 

...through fet at (530, 1319) to GND after 
5526 is driven high at 1956.23ns 

...through fet at (513, 1332) to Vdd after 
5501 is driven low at 1941.04ns 

...through fet at (478, 1322) to 5493 

...through fet at (478, 1317) to GND after 
5248 is driven high at 1909.46ns 

...through fet at (669, 1310) to a7 

...through fet at (683, 1310) to 5363 

...chrough fet at (620, 1334) to Vdd after 
5527 is driven low at 1388.69ns 

..through fet at (585, 1324) to 5509 

... through fet at (585, 1319) to GND after 
5346 is driven high at 1383.38ns 

...through fet at (652, 1394) to Vdd after 
6129 is driven low at 1366.51ns 

...through fet at (530, 1404) to 6181 

...through fet at (530, 1399) to GND after 
6197 is driven high at 1359.33ns 

...through fet at (513, 1412) to Vdd after 
6174 is driven low at 1344.20ns 

..through fet at (478, 1402) to 6166 

..through fet at (478, 1397) to GND after 
5928 is driven high at 1312.98ns 

...through fet at (669, 1390) to a8 

...through fet at (683, 1390) to 6036 

.. through fet at (620, 1414) to Vdd after 
6198 is driven low at 800.6I1ns 

... through fet at (585, 1404) to 6182 

... through fet at (585, 1399) to GND after 
6025 is driven high at 794.45ns 

...through fet at (652, 1474) to Vdd after 
6637 is driven low at 770.92ns 

...through fet at (423, 1482) to 6842 

...through fet at (423, 1477) to GND after 
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6611 is driven high at 739.09ns 
...through fet at (669, 1470) to 6644 
...through fet at (683, 1470) to 6720 
...through fet at (620, 1494) to Vdd after 
755 is driven high at 219.87ns 
...through fet at (634, 410) to Vdd after 
1080 is driven low at 134.69ns 
...through fet at (2443, 2876) to GND after 
7571 is driven high at 10.74ns 
...through fet at (2487, 2858) to Vdd after 
inl6 is driven low at 0.00ns 
10:00.7u 0:00.4s 411k] 


q 
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CRYSTAL results for the clock inputs to 


the registers of the Macpitts chip. 


Crystal, v.2 
: build timing.sim 
'0:13.9u 0:01.6s 258k] 


: inputs phia phib phic 
[0:00.0u 0:00.0s 267k] 


“<“* PHASE 1 OF 5 ae 
: set 1 phia phic 
(0:00.1u 0:00.0s 267k] 
: delay phib 0 -1 
(604 stages examined.) 
[0:00.9u 0:00.1s 271k} 
: critical 
Node 6392 is driven low at 87.36ns 
...through fet at (2322, 1476) to 6678 
...through fet at (2314, 1472) to GND after 
6391 is driven high at &1.45ns 
..through fet at (2290, 1485) to 6679 
...through fet at (2333, 1483) to Vdd after 
588 is driven high at 65.23ns 
..through fet at (2316, 841) to Vdd after 
490 is driven low at 62.98ns 
...through fet at (2314, 834) to GND after 
28 is driven high at 50.57ns 
...through fet at (791, 149) to Vdd after 
21 is driven low at 0.80ns 
...through fet at (817, 134) to GND after 
phib is driven high at 0.00ns 
10:00. 1u 0:00.1s 271k] 


*t* PHASE 2 OF Sauee 
: clear 

[0:00.1u 0:00.0s 271k] 

: set 1 phia 

Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
|0:00.6u 0:00.0s 271k} 

: delay phib -1 0 

(28 stages examined.) 
(0:00.1u 0:00.0s 271k] 

: delay phic -1 0 

(28 stages examined.) 
(0:00.1u 0:00.0s 271k] 
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critical 
Node 590 is driven low at 119.19ns 
...through fet at (2344, 833) to GND after 
491 is driven high at 113.28ns 
...through fet at (2338, 813) to Vdd after 
25 is driven low at 84.73ns 
...through fet at (651, 134) to GND after 
19 is driven high at 10.74ns 
...through fet at (695, 148) to Vdd after 
phic is driven low at 0.00ns 
(0:00.1u 0:00.0s 271k] 


2 RHASE 3 OF 5: *** 
: clear 
(0:00.1u 0:00.0s 271k] 
: set O phib phic 
[(0:00.1u 0:00.0s 271k] 
: delay phia -1 0 
(40 stages examined.) 
[(0:00.1u 0:00.0s 272k] 
: critical 
Node 574 is driven high at 61.22ns 
...through fet at (2087, 841) to Vdd after 
483 1s driven low at 59.1 Ins 
.. through fet at (2085, 834) to GND after 
353 is driven high at 49.97ns 
...through fet at (2088, 802} to Vdd after 
31 is driven low at 30.89ns 
...through fet at (907, 134) to GND after 
23 is driven high at 10.74ns 
...through fet at (951, 148) to Vdd after 
phia is driven low at 0.00ns 
|(0:00.1u 0:00.1s 272k] 


Sel AGE 4 OF 5 >> * 

: clear 

(0:00.1u 0:00.0s 272k| 

: set O phib phic 

[(0:00.1u 0:00.0s 272k| 

: delay phia 0 -1 

(40 stages examined.) 

|(0:00.1u 0:00.0s 274k| 

: critical 

Node 574 is driven low at 54.31ns 
...through fet at (2095, 833) to GND after 
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483 is driven high at 49.17ns 

... through fet at (2089, 813) to Vdd after 
353 is driven low at 27.72ns 

...through fet at (2082, 792) to GND after 
31 is driven high at 15.16ns 

...through fet at (919, 149) to Vdd after 
23 is driven low at 0.80ns 

...through fet at (945, 134) to GND after 
phia is driven high at 0.00ns 

[0:00.1u 0:00.0s 274k] 


+*** PHASE 5:OF oe 
: clear 
(0:00. 1u 0:00.0s 274k| 
: set 1 phia 
Marking transistor flow... 
Setting Vdd to 1... 
Setting GND to 0... 
[(0:00.6u 0:00.1s 274k] 
: set O phib 
(0:00.1u 0:00.0s 274k] 
: delay phic 0-1 
(412 stages examined.) 
[(0:00.5u 0:00.0s 281k] 
: critical 
Node 6674 is driven low at 91.61ns 
...through fet at (2136, 1472) to GND after 
6384 is driven high at 85.13ns 
...through fet at (2116, 1476) to 6673 
...through fet at (2099, 1483) to Vdd after 
578 is driven high at 70.69ns 
...through fet at (2130, 841) to Vdd after 
485 is driven low at 68.5Ins 
... through fet at (2128, 834) to GND after 
25 is driven high at 55.79ns 
...through fet at (663, 149) to Vdd after 
19 is driven low at 0.80ns 
...through fet at (689, 134) to GND after 
phic is driven high at 0.00ns 
|0:00.1u 0:00.0s 281k] 


q 
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POWESST Results for the 16-bit Multiplier 


oe pow Pst -p- mult32.sim 


gamma 0) 4V**.5. tox=9e-08m, 10=0.08m**2,'V-s 
vdd=5V vid=-3.5V. vte=0.8V, vsb=2V 


Hdevs Pde avg (W)  Pde_max (W) type 


0 0.000000 0.000000 enhancement pullups 
3720 31 790881 LSE depletion pullups 

194 0.191948 0.383896 special depletion pullups 
3914 1.982829 3.177428 TOTAL 


POWEST Results for the 8-bit Multiplier. 


% powest -p < multip8c4.sim 


gamma=0.4V**.5, tox=9e-08m, u0=0.0&8m**2/V-s 
Vdd=oN avta=—-o.0V, vte-0.8V, vsb=2V 


#devs Pde avg (W) Pde max (W) type 


0 0.000000 0.000000 enhancement pullup 
690 0.140672 0.244640 depletion pullups 

111 0.211404 0.422809 special depletion pullups 
801 0.352076 0.667449 TOTAL 
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APPENDIX C 


TEST VECTORS 


This appendix contains the inputs, intermediate latch 
values, andthe final product output for each of the test 
vector fairs described in Chapter 3. Each binary value is 
represented as its hexadecimal eyuivalent. The inputs and 
cutputs are represented with their most Significant hexa- 
decimal digit in the leftmost position. The intermediate 
latch contents are represented in hexadecimal with the Nth 
bit shifted out of the latch and placed to the left of the 
previous bit serially shifted out. The latch at the end of 


Stage X is identified as latchX where X goes from 1 to 4. 


ZEST_ VECTOR 1 


INPUTS: 001B OC8F 

OUL PUT. OO0000F15 

LATCH1: 000000000000 000000000000000001107257 
LATCH2: 000 0000000ZA6E7 

PARC. ¢ 000000000000 153DC7 

LATCH4: QOQDQ0000000CASDC?7 


EES be VEC Ae ha 


DINeULS: Piet Ohmi es! 

OUTPUT: PRESEPOES 

LALTC HA: 6596 59659659659659659659768C0Ron72 2 
LAICH2: 155555555 AEE095 

LAPEH 3: 2ZAIAAGOAIAA15D70D15 

LATCH4: AA552595457B8D15 
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INPUIS: 
OUTPUT: 
LATCH1: 
LATCH2: 
ba TCHS: 
LATCH4: 


PNPOUIS: 
OUTEUT: 
LATCH: 
BATCHZ: 
LATCH3: 
LATCH4: 


OO8F . ELE 

PEE OEB 

4104 1041041040 10406016C90B6062250A53 
1 ):3)5) 5/5) 5) sya po peel eh oys 

ZAIAAGAIAAZA580CI93 

AAS 52A954ACSC0C93 


PS 3, ages 

OOO000F15 

QF3CF3CF3CF/CA38E76 &€BEC85B4Y9EO 1EBY29 
OAAB34D556ZzA829 

15559699A AAC 354049 

55ACE9B55 EQAA0N4KY 


ZEST_ VECTOR 5 


myP UTS; 
OOTEUT: 
mATCHT: 
Ae CH2 s 
mALCH3: 
LATCH4: 


0463 O27B 

000 F449 1 

0000 000000000000 0020800884A01F382641F 
OOO000074A 1C641F 

0000000 1450383083F 

OOOO0014A 0E1833F 


moot VECIOR 6 


mie VIS: 
OUTEUT: 
eAtCHT: 
DALTCH2: 
GALC AS : 
LATCH4: 


0373 Peo D 

FrFOBB6F 

47041041040045906 5324551459A2B30F169B 
lp5 5207560 995298 

2ZA9A918FAB130A4513B 

AA5491F56 4C5251B 


=) 


INPUIS $ 
OUTFEUT: 
Pa LeHi: 
LALCEZ: 
LA LenS s 
LATICE&: 


8000 8000 

40000000 

0004 104104104104 104 104104 12000000000 
015555555 &C0000 

O29 AAGAYAAENNN0N NN 

OAD56AB55CC00000 


Wid) 


vale 
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